Introduction

Using twitter we’ll try to find how people feel about Christmas.

Aims: Show how to access APIs and show text mining of the resulting data Train and using a glmnet model for predicting the sentiment of tweets Use of leaflet to visualise the data

See Paul’s previous talk for good 101 on text-mining. https://github.com/RUMgroup/Text-mining Reka’s previous talk on leaflet https://github.com/RUMgroup/leaflet_tutorial

Load/install the other libraries needed for this work

packages<-c("twitteR","streamR","ROAuth","DT","glmnet","text2vec","maps","leaflet","rgdal","raster","maptools","RColorBrewer")
p<-sapply(packages,function(x) {
  if (!require(x,character.only = T))
    install.packages(x)
    library(x,character.only = T)
})
## Loading required package: twitteR
## Loading required package: streamR
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
## Loading required package: ROAuth
## Loading required package: DT
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-13
## Loading required package: text2vec
## Loading required package: maps
## Loading required package: leaflet
## Loading required package: rgdal
## Loading required package: sp
## rgdal: version: 1.2-16, (SVN revision 701)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 1.10.1, released 2013/08/26
##  Path to GDAL shared files: /usr/share/gdal/1.10
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
##  Path to PROJ.4 shared files: (autodetected)
##  Linking to sp version: 1.2-4
## Loading required package: raster
## Loading required package: maptools
## Checking rgeos availability: FALSE
##      Note: when rgeos is not available, polygon geometry     computations in maptools depend on gpclib,
##      which has a restricted licence. It is disabled by default;
##      to enable gpclib, type gpclibPermit()
## Loading required package: RColorBrewer

Collecting tweets

We’ll use the twitter API to access twitter data in an R friendly way.

We first need to connect to the twitter API which requires an account, an registered app to generate api keys and access tokens. Create an app here:

#need a dev twitter account - can make your own easily
API_Key <- ""
API_Secret <- ""
Access_Token <- ""
Access_Secret <-  ""

#authenticate in headless mode
setup_twitter_oauth(API_Key, API_Secret,Access_Token,Access_Secret)

#Let's search past tweets
#has a tendency to be rate limited - wait 15 mins between each big search - takes a couple of mins to complete
tweetsTrump <-searchTwitter(searchString="@realDonaldTrump",n=100,lang="en")

saveRDS(tweetsTrump,file = "data/tweetsTrump.RDS")

searchTwitter returns a list of status objects contain the tweet text and meta info. Reference classes look quite odd in R, more like Java/Python. They call a method from an object and are mutable.

tweetsTrump <- readRDS("data/tweetsTrump.RDS")


#look at the class of the first element in the returned list
class(tweetsTrump[[1]])
## [1] "status"
## attr(,"package")
## [1] "twitteR"
#take a peak at the class structure
str(tweetsTrump[[1]])
## Reference class 'status' [package "twitteR"] with 17 fields
##  $ text         : chr "@realDonaldTrump \"Not\"  funny old school reference"
##  $ favorited    : logi FALSE
##  $ favoriteCount: num 0
##  $ replyToSN    : chr "realDonaldTrump"
##  $ created      : POSIXct[1:1], format: "2017-12-01 09:40:26"
##  $ truncated    : logi FALSE
##  $ replyToSID   : chr "924649059520073730"
##  $ id           : chr "936530418324475904"
##  $ replyToUID   : chr "25073877"
##  $ statusSource : chr "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>"
##  $ screenName   : chr "matthew_darden"
##  $ retweetCount : num 0
##  $ isRetweet    : logi FALSE
##  $ retweeted    : logi FALSE
##  $ longitude    : chr(0) 
##  $ latitude     : chr(0) 
##  $ urls         :'data.frame':   0 obs. of  4 variables:
##   ..$ url         : chr(0) 
##   ..$ expanded_url: chr(0) 
##   ..$ dispaly_url : chr(0) 
##   ..$ indices     : num(0) 
##  and 53 methods, of which 39 are  possibly relevant:
##    getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet,
##    getLatitude, getLongitude, getReplyToSID, getReplyToSN, getReplyToUID,
##    getRetweetCount, getRetweeted, getRetweeters, getRetweets,
##    getScreenName, getStatusSource, getText, getTruncated, getUrls,
##    initialize, setCreated, setFavoriteCount, setFavorited, setId,
##    setIsRetweet, setLatitude, setLongitude, setReplyToSID, setReplyToSN,
##    setReplyToUID, setRetweetCount, setRetweeted, setScreenName,
##    setStatusSource, setText, setTruncated, setUrls, toDataFrame,
##    toDataFrame#twitterObj
#get the screen name
tweetsTrump[[1]]$getScreenName()
## [1] "matthew_darden"
#how times has it been retweeted?
tweetsTrump[[1]]$getRetweetCount()
## [1] 0
#Get the text
tweetsTrump[[1]]$getText()
## [1] "@realDonaldTrump \"Not\"  funny old school reference"
tweetsTrumpText <- sapply(tweetsTrump,function(x) x$getText())
tweetsTrumpText <-iconv(tweetsTrumpText, "latin1", "ASCII", "")

datatable(as.data.frame(tweetsTrumpText),rownames = F)

We are limited to how far we can search back and how much data we can gather.

Let’s listen instead - need to authenticate using ROAUTH which will open a webpage and give you a pin to enter in R

requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"

my_oauth <- OAuthFactory$new(consumerKey=API_Key,
                             consumerSecret=API_Secret, requestURL=requestURL,
                             accessURL=accessURL, authURL=authURL)

#should open up browser and give you a pin to type into R
my_oauth$handshake()

Listen to tweets about Christmas

#Listen to all english language christmas tweets and store as json
filterStream(file.name="tweets_keyword", track=c("Christmas","Xmas"),tweets=200000,oauth=my_oauth,language="en")
tweets.Christmas <- parseTweets("tweets_keyword", verbose = TRUE)
tweets.Christmas <- tweets.Christmas [!duplicated(tweets.Christmas$text),]
saveRDS(tweets.Christmas,file="data/tweets.Christmas.RDS")

Look at the tweets

tweets.Christmas <- readRDS("data/tweets.Christmas.RDS")
#make sure the text is in ASCII
tweets.Christmas$text <-iconv(tweets.Christmas$text, "latin1", "ASCII", "")
#datatable(tweets.Christmas[1:100,],rownames = F)

Tweet Sentiment

We wish to rate the tweets on happiness. This would be time consuming by hand so we can use a classification model trained on 1.6 million tweets to predict the sentiment.

see makeSentimentModel.R for detail of how the model can be created- takes ~40 mins on single cpu

#load the model
sentimentModel <- readRDS("data/glmnet_classifier.RDS")

#load the vectoriser function
vectorizer <- readRDS("data/vectorizer.RDS")

# preprocessing and tokenization
it_tweets <- itoken(tweets.Christmas$text,
                    preprocessor = tolower,
                    tokenizer = word_tokenizer,
                    progressbar = TRUE)

# creating vocabulary and document-term matrix
dtm_tweets <- create_dtm(it_tweets, vectorizer)
## Warning in cpp_vocabulary_corpus_create(vocabulary$term, attr(vocabulary, :
## '.Random.seed' is not an integer vector but of type 'NULL', so ignored
## 
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=================================                                |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |===========================================================      |  90%
  |                                                                       
  |=================================================================| 100%
# transforming data with tf-idf
dtm_tweets_tfidf <- fit_transform(dtm_tweets, TfIdf$new())

# predict probabilities of positiveness
preds_tweets <- predict(sentimentModel, dtm_tweets_tfidf, type = 'response')[ ,1]

# adding rates to initial dataset
tweets.Christmas$sentiment <- preds_tweets 
tweets.Christmas.filt <- tweets.Christmas[order(tweets.Christmas$sentiment),]
tweets.Christmas.filt <- tweets.Christmas.filt[c(1:100,(nrow(tweets.Christmas.filt)-100):nrow(tweets.Christmas.filt)),]

tweets.Christmas.filt$text <-iconv(tweets.Christmas.filt$text, "UTF-8", "ISO-8859-1", "")
#look at the text and the sentiment
datatable(tweets.Christmas.filt[,c("text","sentiment")],rownames = F)

How to people generally feel about Christmas?

boxplot(tweets.Christmas$sentiment)

Infer the lat and long of the tweet

#get longitude and lattidue for tweets from the location data
data(world.cities)

#modified function from - http://biostat.jhsph.edu/~jleek/code/twitterMap.R
findLatLon <- function(loc){
  latlon = NA
  cont = NA
  
  # Asia = 1, Africa = 2, North America = 3, South America = 4, Australia/New Zealand = 5, Europe = 6
  continents = matrix(NA,nrow=length(unique(world.cities[,2])),ncol=2)
  continents[,1] = unique(world.cities[,2])
  continents[1:10,2] = c(1,1,1,2,1,1,1,1,1,1)
  continents[11:20,2]= c(1,1,2,1,1,2,1,2,2,2)
  continents[21:30,2] = c(2,1,6,6,6,6,6,6,6,6)
  continents[31:40,2] = c(6,6,6,6,2,4,4,1,2,1)
  continents[41:50,2] = c(4,6,1,4,6,1,3,1,6,6)
  continents[51:60,2] = c(3,2,4,2,6,1,6,1,3,2)
  continents[61:70,2] = c(1,2,2,2,3,6,3,3,6,6)
  continents[71:80,2] = c(1,1,2,6,3,4,3,4,6,1)
  continents[81:90,2] = c(3,3,3,2,2,6,6,6,6,4)
  continents[91:100,2] = c(2,5,2,2,3,1,1,1,1,1)
  continents[101:110,2] = c(1,2,1,1,1,3,2,5,1,6)
  continents[111:120,2] = c(1,6,1,1,2,6,1,1,6,2)
  continents[121:130,2] = c(6,6,6,1,1,3,4,3,4,2)
  continents[131:140,2] = c(6,6,2,2,1,1,1,4,1,1)
  continents[141:150,2] = c(1,2,2,1,1,1,4,6,6,2)
  continents[151:160,2] = c(4,1,1,1,1,2,4,6,2,2)
  continents[161:170,2] = c(1,2,2,1,6,2,1,1,6,1)
  continents[171:180,2] = c(1,1,1,2,6,2,2,6,1,1)
  continents[181:190,2] = c(2,6,2,1,6,6,3,3,3,3)
  continents[191:200,2] = c(2,2,2,2,3,2,3,2,3,1)
  continents[201:210,2] = c(3,2,2,2,2,2,2,1,6,2)
  continents[211:220,2] = c(1,3,1,6,2,4,3,6,3,4)
  continents[221:230,2] = c(1,1,1,3,2,3,3,6,1,6)
  continents[231:232,2] = c(2,1)
  
  
  # Get the first element of the location
  # firstElement = strsplit(loc,"[^[:alnum:]]")[[1]][1]
  firstElement = strsplit(loc,",")[[1]][1]
  if(is.na(firstElement)){firstElement="zzzzzzzzz"}
  
  # See if it is a city
  tmp = grep(firstElement,world.cities[,1],fixed=TRUE)
  tmp2 = grep(firstElement,state.name,fixed=TRUE)
  tmp3 = grep(firstElement,world.cities[,2],fixed=TRUE)
  
  if(length(tmp) == 1){
    latlon = world.cities[tmp,c(5,4)]
    cont = continents[which(world.cities[tmp,2]==continents[,1]),2]
  }else if(length(tmp) > 1){
    tmpCities = world.cities[tmp,]
    latlon = tmpCities[which.max(tmpCities$pop),c(5,4)]
    cont = continents[which(tmpCities[which.max(tmpCities$pop),2]==continents[,1]),2]
  }else if(length(tmp2) == 1){
    latlon = c(state.center$x[tmp2],state.center$y[tmp2])
    cont = 3
  }else if(length(tmp3) > 0){
    tmpCities = world.cities[tmp3,]
    latlon = tmpCities[which.max(tmpCities$pop),c(5,4)]
    cont = continents[which(tmpCities[which.max(tmpCities$pop),2]==continents[,1]),2]
  }
  
  #return(list(latlon=latlon,cont=as.numeric(cont)))
  return(latlon)
}

tweets.Christmas$location <-iconv(tweets.Christmas$location, "latin1", "ASCII", "")
locs<-as.data.frame(tweets.Christmas$location)
locs_lat <-apply(locs,1,findLatLon)
saveRDS(locs_lat,"data/loc_lat.RDS")

Map the tweets locations

locs_lat <- readRDS("data/loc_lat.RDS")
tweets.Christmas$longitude <- unlist(lapply(locs_lat,"[",1))
tweets.Christmas$latitude <- unlist(lapply(locs_lat,"[",2))
tweets.Christmas<- tweets.Christmas[ !is.na(tweets.Christmas$longitude),]

m <- leaflet(tweets.Christmas[1:500,]) %>%
 addProviderTiles("CartoDB.Positron") %>%
  addMarkers(lng=~longitude, lat=~latitude)
m

Map of sentiment

Colour each area by mean sentiment

#get a shape file
regions <- getData('GADM', country='GB', level=2)

#find the intersect of the lat-long with the shapefile polygons

p <- SpatialPointsDataFrame(coords = tweets.Christmas[,c("longitude","latitude")],data=data.frame(ID=paste0("tweet",1:nrow(tweets.Christmas)),sentiment=tweets.Christmas$sentiment))

proj4string(p)<-CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
regions <- spTransform(regions, CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"))

#intersect
res <- over(regions, p,returnList = T)


#mean sentiment per polygon
res <- lapply(res,function(x) {
  if(nrow(x)<1){
   return(0.5)
  } else{
    mean(x[,"sentiment"])
  }})

#assign the mean sentiment to the polygon
regions$sentiment <-unlist(res)


#What are the happy Chrimas tweeting places?
head(regions[ order(regions$sentiment,decreasing = F),])
##     OBJECTID ID_0 ISO         NAME_0 ID_1           NAME_1 ID_2
## 113      113  242 GBR United Kingdom    2 Northern Ireland  113
## 129      129  242 GBR United Kingdom    2 Northern Ireland  129
## 122      122  242 GBR United Kingdom    2 Northern Ireland  122
## 135      135  242 GBR United Kingdom    2 Northern Ireland  135
## 149      149  242 GBR United Kingdom    3         Scotland  149
## 152      152  242 GBR United Kingdom    3         Scotland  152
##                NAME_2 HASC_2 CCN_2 CCA_2           TYPE_2        ENGTYPE_2
## 113            Antrim  GB.AN    NA               District         District
## 129             Larne  GB.LR    NA               District         District
## 122         Coleraine  GB.CL    NA               District         District
## 135      Newtownabbey  GB.NW    NA               District         District
## 149 East Renfrewshire  GB.ER    NA       Unitary District Unitary District
## 152           Falkirk  GB.FK    NA       Unitary District Unitary District
##     NL_NAME_2 VARNAME_2 sentiment
## 113                     0.1486539
## 129                     0.3972214
## 122                     0.4834109
## 135                     0.4837799
## 149                     0.4910031
## 152                     0.4933216
#make a colour palette - purple=sad,green=happy 
col <- colorNumeric("PiYG",domain=c(0,1))

#map the shape file coloured by the mean sentiment
m <- leaflet(regions) %>%   addProviderTiles("Stamen.Toner") %>%
  addPolygons( stroke=F,color = ~col(sentiment))
m